Data Preprocessing

1. Introduction

Data preprocessing is a crucial step in implementing Retrieval Augmented Generation (RAG) functionality, as it enables the transformation of raw data into a structured and contextualized format that can be effectively processed by Large Language Models (LLMs). This process involves a series of steps, including data merging, contextualizing, and preparation, to ensure that the input data is accurate, consistent, and suitable for RAG analysis. By preprocessing the data, we can unlock the full potential of LLMs and Watson Discovery, enabling more accurate and reliable results in RAG-related applications.

2. Input Data

  • Trial Balance spreadsheet that contains different accounts and the respective numbers.
  • Columns: Accounts - Description, 2020, Q1-2020, Q2-2020, etc.

3. Contextualizing Data for LLM and RAG

  • When a CSV file is uploaded to Watson Discovery, each row is transformed into a JSON object, allowing for efficient processing and analysis. However, for RAG use cases, the data can be scattered and fragmented, making it challenging for Large Language Models (LLMs) to process effectively.
  • To address this limitation, we developed a custom data pipeline that transforms all columns of data into a single paragraph of information. This pipeline enables the creation of a cohesive and structured input for LLMs, facilitating more accurate and reliable processing of RAG-related data.

4. Preparing Data for Watson Discovery

  • Saving data as CSV files as Watson Discovery separates each CSV row into independent documents that could then be used for RAG and LLM Search.

5. Conclusion

  • With spreadsheets as input data, it is important to clean and contextualize the data using a custom data pipeline. This will ensure the RAG technique to work properly and allow LLM to understand the input data better. Then, store data in Watson Discovery before proceeding with integrating with watsonx assistant and watsonx.ai.